Implement and manage end-to-end observability using Datadog, covering infrastructure, applications, and services across on-premises, cloud, and hybrid environments.
Instrument agents and telemetry frameworks (e.g., OpenTelemetry) for comprehensive monitoring and data collection.
Design, deploy, and maintain service-level monitoring solutions including dashboards, alerts, SLA/SLO definitions, and anomaly detection with actionable notifications.
Integrate Datadog with third-party tools such as ServiceNow, SSO, and other ITSM platforms to streamline incident management and operational workflows.
Build and maintain observability platforms that aggregate logs, metrics, and traces to provide actionable insights into system health and performance.
Collaborate closely with development, SRE, and DevOps teams to embed observability best practices aligned with business and operational goals.
Automate monitoring and telemetry configurations using Infrastructure as Code (IaC) tools such as Terraform and Ansible.
Develop scripts and automation workflows primarily in Python to optimize cloud agent instrumentation and operational tasks.
Ensure security and vulnerability management within observability frameworks.
Required Skills & Experience
Minimum 2 years of experience working with cloud-based observability solutions focusing on monitoring, logging, and tracing across AWS, Azure, and GCP environments.
Expertise with Datadog, including Datadog Fundamentals, APM & Distributed Tracing Fundamentals, and Datadog Demo certifications (mandatory).
Proficient in Python programming for automation and scripting.
Hands-on experience with AWS cloud services and observability tools such as Prometheus, Grafana, Splunk, ELK Stack, and AWS CloudWatch.
Strong understanding of observability concepts—logs, metrics, and tracing—and their implementation in production environments.
Experience integrating monitoring tools with ITSM platforms like ServiceNow and enabling SSO.
Familiarity with Infrastructure as Code (Terraform, Ansible) for automating infrastructure and monitoring setups.
Solid background in system and software engineering practices, CI/CD pipelines (e.g., Jenkins), and container orchestration platforms such as Kubernetes.
Demonstrated knowledge of security best practices and vulnerability management in monitoring environments.
Prior banking or financial services experience is required.